Understanding Model Fitting with Likelihood

Stats
Bayesian
Author

Yiğit Aşık

Published

January 25, 2025

In daily life, “probability” and “likelihood” are used interchangeably. But, they don’t refer to the same thing. I believe the most intuitive way to distinguish both is by thinking about the information you have while answering questions:

We can use this type of approach and let the parameter values vary and calculate the likelihood of the data under each value of the parameter(s).

import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

import statsmodels as sm
import empiricaldist as emd
import scipy.stats as st

import utils as ut

import warnings

warnings.filterwarnings('ignore')
possum = pd.read_csv('possum.csv', usecols=['pop', 'tail_l'])
df = possum.copy()

df.head()
pop tail_l
0 Vic 36.0
1 Vic 36.5
2 Vic 39.0
3 Vic 38.0
4 Vic 36.0

We’ll use possum data, where each row corresponds to features of a possum from Australia and New Guinea. Let’s attempt to predict which region a possum is from as a function of tail length. (You can get the data from here)

I assume basic knowledge of logistic regression, which we’ll use to model the relationship between our predictor and the binary target variable.

qs_cept = np.linspace(-3, 3, 241) # possible values for intercept (beta_0)
qs_slope = np.linspace(-1, 1, 81) # possible values for slope (beta_1)
print(qs_cept[:5], '\n', qs_slope[:5])
[-3.    -2.975 -2.95  -2.925 -2.9  ] 
 [-1.    -0.975 -0.95  -0.925 -0.9  ]
df_likeli = pd.DataFrame(index=range(len(qs_cept)), columns=range(len(qs_slope)))
df_likeli.index = qs_cept
df_likeli.columns = qs_slope
df_likeli.fillna(1, inplace=True)

df_likeli.head()
-1.000 -0.975 -0.950 -0.925 -0.900 -0.875 -0.850 -0.825 -0.800 -0.775 ... 0.775 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000
-3.000 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
-2.975 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
-2.950 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
-2.925 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
-2.900 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1

5 rows × 81 columns

So, what I have done is I created grid of equally spaced values both for intercept and the slope. I put them into a dataframe where columns represent possible values for slope and indices represent possible values for intercept.

Now, I ask this question: For each pair of intercept and slope, b0 and b1, how likely I am to see the observed data?

df['pop'] = df['pop'].apply(lambda x: 1 if x == 'Vic' else 0)
# Centering data

offset = df['tail_l'].mean().round()
df['x'] = df['tail_l'] - offset
df['y'] = df['pop']

# I refer our predictor as x from now on (for convenience), y becomes our target variable which takes 1 if possum is from Victoria region and 0 otherwise.
agg_data = df.groupby('x')['y'].agg(['sum', 'count'])
agg_data.head(10).T
x -5.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5
sum 2 1 4 2 6 6 9 4 4 1
count 2 1 5 2 9 7 13 12 6 4
ns = agg_data['count'] # represents number of observation with corresponding x values
ks = agg_data['sum'] # represents successes, which means Victoria region

People get confused, rightfully, when they hear logistic regression getting mentioned as a linear model since all they see is an S-shaped function in a classic graph where x-axis represents the predictor and y-axis represents the probability. But that squiggle is a result of a transformation of log odds into probabilities. You may have deduced that before if you have taken a look at the equation:

\(\displaystyle log_e(\frac{p_i}{1 - p_i}) = \beta_0 + \beta_1 x_{1i} + ... + \beta_k x_{k_i}\)

However, we know that odds can be expressed as probabilities, so we’ll make that transformation.

For representation let’s select an intercept and a slope:

cept = 0
slope = 1
x_values = agg_data.index
log_odds = cept + slope * x_values
odds = np.exp(log_odds)
ps = odds / (odds + 1)
ps[:6] # Probabilities coming from the model
Index([0.006692850924284856, 0.029312230751356316,  0.04742587317756679,
        0.07585818002124356,  0.11920292202211755,  0.18242552380635632],
      dtype='float64', name='x')

How likely I am to observe k success in n trials with \(p_i\) where \(p_i\) comes from the model above.

# How likely I am to observe k success in n trials with p_i where p_i comes from model
likelis = st.binom.pmf(ks, ns, ps)
likelis[:6]
array([4.47942535e-05, 2.93122308e-02, 2.40951774e-05, 5.75446348e-03,
       1.64675234e-04, 2.10930303e-04])
# Taking the product for whole data
likelis.prod()
6.58621661515704e-55

Let’s take a look at the fit:

plt.figure(figsize=(6,4))

plt.plot(x_values+offset, ps, label='random selected model', color='C1')
plt.scatter(df['x']+offset, df['y'], s=30, label='data')

ut.decorate()

This was for only one pair of intercept and slope, and it seems like we can do better. We calculated the likelihood for educational purposes here but in general, likelihoods by themselves does not mean much (I try to choose my words carefully here). Let’s try other possible pairs for our parameters and compare the likelihoods of each to take the one that maximizes the likelihood of the observed data.

from scipy.special import expit

likelihood = df_likeli.copy()

for cept in likelihood.index:
    for slope in likelihood.columns:
        probs = expit(cept + slope * x_values) # transformation to probabilities
        likelis = st.binom.pmf(ks, ns, probs) # likelihood of each observation
        likelihood.loc[likelihood.index == cept, slope] = likelis.prod() # likelihood of the whole data under the selected pair of parameter values
row, col = likelihood.stack().idxmax()
print(row, col)
-0.32499999999999973 -0.7

This is the pair that maximizes the likelihood.

cept = row
slope = col

log_odds = cept + slope * x_values
odds = np.exp(log_odds)
ps = odds / (odds + 1)
plt.figure(figsize=(6,4))

plt.plot(x_values+offset, ps, label='model', color='C1')
plt.scatter(df['x']+offset, df['y'], s=30, label='data')

ut.decorate()

Let’s check our parameter estimates with the help of statsmodels:

import statsmodels.formula.api as smf

formula = 'y ~ x'
results = smf.logit(formula, data=df).fit(disp=False)
results.params
Intercept   -0.320498
x           -0.699641
dtype: float64

We didn’t get those exact values since we used grid approximation, and the grid that I created didn’t include those exact values. Let’s drop a few details here:

I hope this example made likelihoods more intuitive. There’s more to know about them: Are they probability densities? Are they probability masses? Or are they both? That’s for a whole different post, things can get messy in a second. However, it’s important to understand likelihoods in the context above, since they can be used to update our beliefs about parameters which is what we do with Bayesian methods.

As usual, have a nice weekend :)